Packages
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("tidyverse") # workflow and plots
if (!require("rmdformats")) install.packages("rmdformats")## Loading required package: rmdformats
Background and Goals
Blood was drawn from the postorbital sinus of Blunt-nosed Leopard Lizards (Gambelia sila) between April - July 2021. After centrifuging and separating, plasma was run on a VAPRO vapor pressure osmometer in 1-3 replicates, when plasma volume allowed. In this R script, I check the distribution of replicates, omit outliers, and average remaining replicates. The final values will be more precise and accurate estimates of the true plasma osmolality for each lizard, and those values will be used in the analyses R script file. Please refer to doi: for the published scientific paper and full details.
Load Data
osml_reps <- read.csv("./data/osmolality.csv",
na.strings = c("","NA"),
header = TRUE
) %>%
dplyr::mutate(blood_draw_date = as.Date(blood_draw_date,
format = "%m/%d/%y"),
individual_ID = as.factor(individual_ID),
replicate_no = as.factor(replicate_no),
osmolality_mmol_kg = as.numeric(osmolality_mmol_kg),
hemolyzed_Y_N = as.factor(hemolyzed_Y_N),
)
summary(osml_reps)## blood_draw_date individual_ID replicate_no osmolality_mmol_kg
## Min. :2021-04-23 F-12 : 9 1:116 Min. :306.0
## 1st Qu.:2021-04-24 M-19 : 9 2:107 1st Qu.:350.0
## Median :2021-04-25 M-09 : 8 3: 84 Median :366.0
## Mean :2021-05-09 M-10 : 8 Mean :368.5
## 3rd Qu.:2021-05-08 M-11 : 8 3rd Qu.:382.0
## Max. :2021-07-14 M-20 : 8 Max. :452.0
## (Other):257
## hemolyzed_Y_N
## N :222
## Y : 80
## NA's: 5
##
##
##
##
## [1] "2021-04-23" "2021-04-24" "2021-04-25" "2021-05-07" "2021-05-08"
## [6] "2021-07-14"
Replicates
Now, I will try to identify outliers within the replicates for a given individual on a given date. There must be at least 3 replicates to do this, so the first thing I need to do is figure out which individuals/dates have enough replicates, then subset my data to be only those individuals.
Individuals w 3+ Replicates
# identify individuals with 3-4 reps
enuf_reps <- osml_reps %>%
group_by(individual_ID, blood_draw_date) %>%
mutate(count = n()) %>%
dplyr::filter(count > 2) %>%
arrange(count)
enuf_reps## # A tibble: 252 × 6
## # Groups: individual_ID, blood_draw_date [84]
## blood_draw_date individual_ID replicate_no osmolality_mmol_kg hemolyz…¹ count
## <date> <fct> <fct> <dbl> <fct> <int>
## 1 2021-04-23 M-02 3 382 N 3
## 2 2021-04-23 M-02 2 339 N 3
## 3 2021-04-23 M-02 1 349 N 3
## 4 2021-04-23 M-06 3 346 N 3
## 5 2021-04-23 M-06 2 340 N 3
## 6 2021-04-23 M-06 1 351 N 3
## 7 2021-04-24 M-10 3 391 Y 3
## 8 2021-04-24 M-10 2 417 Y 3
## 9 2021-04-24 M-10 1 424 Y 3
## 10 2021-04-23 F-01 3 337 N 3
## # … with 242 more rows, and abbreviated variable name ¹hemolyzed_Y_N
# identify individuals with 1-2 reps
not_reps <- osml_reps %>%
group_by(individual_ID, blood_draw_date) %>%
mutate(count = n()) %>%
dplyr::filter(count < 3) %>%
arrange(count)
not_reps## # A tibble: 55 × 6
## # Groups: individual_ID, blood_draw_date [32]
## blood_draw_date individual_ID replicate_no osmolality_mmol_kg hemolyz…¹ count
## <date> <fct> <fct> <dbl> <fct> <int>
## 1 2021-04-23 M-04 1 349 N 1
## 2 2021-04-23 M-05 1 348 N 1
## 3 2021-04-23 M-08 1 396 N 1
## 4 2021-04-23 W-002 1 360 <NA> 1
## 5 2021-04-23 W-005 1 361 <NA> 1
## 6 2021-04-24 W-006 1 334 Y 1
## 7 2021-04-24 W-007 1 409 Y 1
## 8 2021-05-08 W-001 1 390 N 1
## 9 2021-05-08 W-017 1 366 N 1
## 10 2021-04-23 M-03 1 372 N 2
## # … with 45 more rows, and abbreviated variable name ¹hemolyzed_Y_N
## [1] 307
## [1] TRUE
Most of the blood samples had enough plasma for 3 reps :)
Assess Variation
We want the Coefficient of Variation (CV) among our technical replicates to be small. We need to calculate it to identify whether there may be outliers.
CVs <- enuf_reps %>%
group_by(individual_ID, blood_draw_date) %>%
summarise(mean = mean(osmolality_mmol_kg),
SD = sd(osmolality_mmol_kg),
CV = (SD/mean) *100,
min = min(osmolality_mmol_kg),
max = max(osmolality_mmol_kg),
range = max - min
)## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## individual_ID blood_draw_date mean SD
## F-12 : 3 Min. :2021-04-23 Min. :309.3 Min. : 0.5774
## M-19 : 3 1st Qu.:2021-04-24 1st Qu.:348.2 1st Qu.: 3.5119
## F-01 : 2 Median :2021-04-25 Median :364.8 Median : 6.3705
## F-03 : 2 Mean :2021-05-09 Mean :366.3 Mean : 9.2897
## F-06 : 2 3rd Qu.:2021-05-08 3rd Qu.:381.1 3rd Qu.:14.5373
## F-10 : 2 Max. :2021-07-14 Max. :437.7 Max. :29.8161
## (Other):70
## CV min max range
## Min. :0.1701 Min. :306.0 Min. :313.0 Min. : 1.00
## 1st Qu.:0.9814 1st Qu.:339.8 1st Qu.:358.0 1st Qu.: 7.00
## Median :1.7355 Median :359.0 Median :372.5 Median :12.00
## Mean :2.5162 Mean :358.1 Mean :375.8 Mean :17.69
## 3rd Qu.:4.1182 3rd Qu.:372.2 3rd Qu.:387.2 3rd Qu.:27.50
## Max. :7.6452 Max. :434.0 Max. :452.0 Max. :58.00
##
Ideally, CV would be <10-15%. If it’s larger, and one of the replicates is very different than the others, we can assume that the replicates that are closer together are more reliable. CV values look really good, which is surprising because the range of values for some groups of replicates is 20-60 mmol/kg, which is very very high. So, I don’t think we will necessarily find statistical outliers.
Find Outliers
# write function to find outliers for each individual on each date
find_outliers <- function(df) {
# initiate dataframe to compile info and list to compile plots
outliers <- data.frame()
#boxplots <- list()
# initiate a for loop to go through every who in df
for(indiv_ch in unique(df$individual_ID)) {
# select data for only the individual of interest
df_sub <- df %>%
dplyr::filter(individual_ID == as.character(indiv_ch))
# make a boxplot
df_sub %>%
ggplot(.) +
geom_boxplot(aes(x = as.factor(blood_draw_date),
y = osmolality_mmol_kg,
fill = as.factor(blood_draw_date))) +
ggtitle(paste("Individual", indiv_ch)) +
theme_classic() -> plot
# print/save
print(plot)
#boxplots[[indiv_ch]] <- plot
# extract outliers
outs <- df_sub %>%
group_by(individual_ID, blood_draw_date) %>%
summarise(outs = boxplot.stats(osmolality_mmol_kg)$out)
# add to running dataframe of outliers
outliers <- outliers %>%
rbind(outs)
}
#return(boxplots)
return(outliers)
}Now apply the function to the data:
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'blood_draw_date'. You can
## override using the `.groups` argument.
## # A tibble: 0 × 3
## # Groups: individual_ID, blood_draw_date [0]
## # … with 3 variables: individual_ID <fct>, blood_draw_date <date>, outs <dbl>
As expected, based on boxplots, there are no outliers, but we still want to find and omit the points severely increasing the replicate ranges.
Determine which replicates lead to an increased CV… The osmometer is supposed to have measurement precision within a range of ±3 mmol/kg, so if a single point is >10 mmol/kg different from both other measurements, then we should consider omitting it as an extreme value.
# only do this if the range > 10, otherwise it messes up
high_ranges <- enuf_reps %>%
group_by(individual_ID, blood_draw_date) %>%
dplyr::mutate(mean = mean(osmolality_mmol_kg),
min = min(osmolality_mmol_kg),
max = max(osmolality_mmol_kg),
osml_range = max - min) %>%
dplyr::filter(osml_range > 10) %>%
dplyr::select(individual_ID, blood_draw_date, osmolality_mmol_kg, replicate_no, osml_range)
CV_12 <- high_ranges %>% # get CV for only reps 1&2
dplyr::filter(replicate_no != 3) %>%
group_by(individual_ID, blood_draw_date) %>%
summarise(osml_range = max(osmolality_mmol_kg) - min(osmolality_mmol_kg)
) %>%
mutate(rep_excluded = "3")## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
CV_23 <- high_ranges %>% # get CV for only reps 2&3
dplyr::filter(replicate_no != 1) %>%
group_by(individual_ID, blood_draw_date) %>%
summarise(osml_range = max(osmolality_mmol_kg) - min(osmolality_mmol_kg)
) %>%
mutate(rep_excluded = "1")## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
CV_31 <- high_ranges %>% # get CV for only reps 3&1
dplyr::filter(replicate_no != 2) %>%
group_by(individual_ID, blood_draw_date) %>%
summarise(osml_range = max(osmolality_mmol_kg) - min(osmolality_mmol_kg)
) %>%
mutate(rep_excluded = "2")## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
# figure out what replicate inflates CV (and range)
compare <- high_ranges %>%
mutate(rep_excluded = "none") %>%
rbind(CV_12) %>%
rbind(CV_23) %>%
rbind(CV_31) %>%
group_by(individual_ID, blood_draw_date) %>%
dplyr::mutate(min_range = min(osml_range)) %>%
dplyr::filter(min_range == osml_range) %>%
# reset factor after getting rid of "none's"
mutate(rep_excluded = factor(rep_excluded)) %>%
arrange(individual_ID, blood_draw_date)
compare## # A tibble: 50 × 7
## # Groups: individual_ID, blood_draw_date [49]
## individual_ID blood_draw_date osmolality_mm…¹ repli…² osml_…³ rep_e…⁴ min_r…⁵
## <fct> <date> <dbl> <fct> <dbl> <fct> <dbl>
## 1 F-01 2021-05-08 NA <NA> 1 1 1
## 2 F-02 2021-04-23 NA <NA> 12 1 12
## 3 F-05 2021-04-24 NA <NA> 0 1 0
## 4 F-06 2021-04-24 NA <NA> 19 2 19
## 5 F-06 2021-05-08 NA <NA> 2 2 2
## 6 F-08 2021-04-24 NA <NA> 6 3 6
## 7 F-09 2021-04-24 NA <NA> 13 3 13
## 8 F-10 2021-04-24 NA <NA> 10 2 10
## 9 F-12 2021-07-14 NA <NA> 4 1 4
## 10 F-13 2021-04-23 NA <NA> 4 1 4
## # … with 40 more rows, and abbreviated variable names ¹osmolality_mmol_kg,
## # ²replicate_no, ³osml_range, ⁴rep_excluded, ⁵min_range
## individual_ID blood_draw_date osmolality_mmol_kg replicate_no
## M-03a : 3 Min. :2021-04-23 Min. : NA 1 : 0
## F-06 : 2 1st Qu.:2021-04-24 1st Qu.: NA 2 : 0
## F-17 : 2 Median :2021-04-24 Median : NA 3 : 0
## M-10 : 2 Mean :2021-05-06 Mean :NaN NA's:50
## M-12 : 2 3rd Qu.:2021-05-07 3rd Qu.: NA
## M-14 : 2 Max. :2021-07-14 Max. : NA
## (Other):37 NA's :50
## osml_range rep_excluded min_range
## Min. : 0.00 1:18 Min. : 0.00
## 1st Qu.: 3.00 2:16 1st Qu.: 3.00
## Median : 5.50 3:16 Median : 5.50
## Mean : 6.58 Mean : 6.58
## 3rd Qu.: 9.50 3rd Qu.: 9.50
## Max. :27.00 Max. :27.00
##
Make sure we only found one rep to remove per indiv per date:
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## # A tibble: 1 × 3
## # Groups: individual_ID [1]
## individual_ID blood_draw_date n
## <fct> <date> <int>
## 1 M-03a 2021-05-07 2
Okay, we just won’t remove any of the replicates for that individual.
Remove Outliers
# need to save these or they get filtered out by default
save <- enuf_reps %>%
left_join(done_compare, by = c("individual_ID", "blood_draw_date")) %>%
dplyr::filter(is.na(rep_excluded) == TRUE)
# remove "outlying" replicates
cleaned <- enuf_reps %>%
left_join(done_compare, by = c("individual_ID", "blood_draw_date")) %>%
dplyr::filter(replicate_no != rep_excluded)
# check number of data obs
nrow(enuf_reps) == (nrow(done_compare) + nrow(cleaned) + nrow(save))## [1] TRUE
Re-Assess Variation
full_cleaned_enuf <- save %>%
rbind(cleaned)
check <- full_cleaned_enuf %>%
group_by(individual_ID, blood_draw_date) %>%
summarise(osmolality_mmol_kg_mean = mean(osmolality_mmol_kg),
rep_SD = sd(osmolality_mmol_kg),
rep_CV = (rep_SD/osmolality_mmol_kg_mean) *100,
rep_min = min(osmolality_mmol_kg),
rep_max = max(osmolality_mmol_kg),
rep_range = rep_max - rep_min
)## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
The ranges and CVs are much improved.
Average Remaining Replicates
Now that the outliers are removed from the technical replicates when there were enough replicates to identify them, I will average the remaining replicates and check that it meaningfully improved things.
osml_means <- not_reps %>%
rbind(full_cleaned_enuf) %>%
group_by(individual_ID, blood_draw_date) %>%
summarise(osmolality_mmol_kg_mean = mean(osmolality_mmol_kg))## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.